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Abstract 

We are facing a real challenge when coping with the continuous acceleration of 
scientific production and the increasingly changing nature of science. In this arti- 
cle, we extend the classical framework of co-word analysis to the study of scientific 
landscape evolution. Capitalizing on formerly introduced science mapping meth- 
ods with overlapping clustering, we propose methods to reconstruct phylogenetic 
networks from successive science maps, and give insight into the various dynamics 
of scientific domains. Two indexes - the pseudo-inclusion and the empirical quality 
- are introduced to qualify scientific fields and are used for reconstruction valida- 
tion purpose. Phylogenetic dynamics appear to be strongly correlated to these two 
indexes, and to a weaker extent, to a third one previously introduced (density in- 
dex). These results suggest that there exist regular patterns in the "life cycle" of 
scientific fields. The reconstruction of science phylogeny should improve our global 
understanding of science evolution and pave the way toward the development of 
innovative tools for our daily interactions with its productions. Over the long run, 
these methods should lead quantitative epistemology up to the point to corrobo- 
rate or falsify theoretical models of science evolution based on large-scale phylogeny 
reconstruction from databases of scientific literature. 

Keywords: science dynamics — co-word analysis — phylogeny — reconstruction 

We are facing a real challenge when coping with the increasingly changing nature 
of science. First, the millions of papers published every year make clearly impossible 
for anybody to be aware of all the important breakthroughs and developments in sci- 
ence. This issue is made even more critical by the continuous acceleration of scientific 
production, which threatens every scholar with information overload (the volume of pub- 
lications per year has doubled the last 12 years). Second, although science is not carved 
in marble and would better be defined as an ever-changing enterprise [ ] , a lively debate 
has been taken place for more than 10 years around the shift toward a new regime of 
knowledge production following the transformation of the nature of the research process. 

*Both authors have equally contributed to this work. 
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According to [ ] science has recently entered a new mode, where knowledge is generated 
within a wider context of application, making full place to trans-disciplinarity, defined 
as the circulation of tools, theoretical perspectives, and people. Whatever the causes of 
such transformations, the frontiers of science indeed appear to be even faster changing 
and getting blurred as fields and sub-fields are cross-fertilizing, growing or dying. There 
is an urge to map these fluctuating landscapes. 

Science mapping is one of the aims of scientometrics, a young science that took off 
in the late seventies, fostered by the development of electronic scientific databases and 
the increasing power of computers. Data-mining methods (in the wide sense) have been 
developed that make it possible to identify patterns, or meso structures in scientific 
corpora that make sense to us (e.g. scientific fields or epistemic fields). The articulation 
between these scientific fields are then displayed on science maps to give overviews of 
scientific domains. 

Part of the utility of science maps, both for theorists (science studies, history and 
philosophy of science), for users (scientists) or policy makers, comes from their capacity 
to give meaning to the evolution of science: what are the emergent fields, the continuities 
and main paradigmatic shifts, and from which scientific fields does a new field inherit its 
intellectual background. There is thus an important concern about reconstructing these 
dynamics in such a way that fields of knowledge could be tracked through time. From 
the theoretical point of view, this entails that the core object in the representation of 
the evolution of science is a phylogenetic network while most scientometrics studies focus 
on science snapshots. In this article, we will show that co-word analysis is a suitable 
approach from this perspective and propose methods for an automated reconstruction 
of science phylogenies. The core question is: How can we reconstruct science dynamics 
through automated bottom-up analysis of scientific publications? 

1 Science mapping 

A large proportion of science maps are built upon co-occurrence data, with the assump- 
tion that the more likely two elements co-occur in the same article, the more they are 
related, and the closer they should appear on the map. These co-occurrence data can be 
of different nature: co-authorship networks, [ ], co-citation networks, [ ] or co-word 
networks ([3], [4]). In what follows, we will focus on these latter in the framework of 
co-word analysis. In this approach, co-occurrences of terms are indexed in large corpora. 
A graph structure is then generated, where nodes represent the terms, and strength of 
links represents their alleged similarity. This similarity measure is computed from co- 
occurrences data. Higher level structures reflecting domains of science are then derived 
by analyzing patterns in this graph with clustering methods. 

Scientometrics has defined a great number of measures based on co-occurrence data 
that capture the degree of similarity or proximity between two terms (cf. [ ] for a good 
review). Among others, we can mention two indexes that have been introduced early in 

n 2 

scientometrics: the inclusion index — — r and the proximity index — — \ 1. Here, 
rii (respectively rij and mj) is the number of articles mentioning the term i (respectively 
j and both i and j). 

Further measures where later introduced. However, most of them, by synthesizing the 
relation between two terms with a single number, fail to convey important information 
about their use: given two terms i and j, is one more specific or more generic than the 
other? Is i more specific in the sense that it tends to be used by a sub-community of the 
community using j? 

We assume that the asymmetrical relation between terms is an essential information 
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to get insight into the overall structure of science (fields and subfields). It can be captured 
by an appropriate choice of proximity measure such that the pseudo-inclusion measure 

defined over a period T by 1 : P%(iJ) = ((^H^) 1/a ) min(a ' « } . 

This measure has the advantage to convey information about the relative position of 
two terms from the point of view of their use: terms j such that P^(i^j) is close to 1 
will contextualize i for a ^> 1 and will tend to be more specific in their use relatively to 
i for < a <C 1 (see [ ] for more details) 2 . 

The pseudo-inclusion measure also enables a natural representation of the internal 
structure of a cluster C. To each term w in C, two coordinates (Zf (w) , I£ (w)) can be 
assigned to qualify its degree of specificity and genericity relatively to other terms in 
C. The specificity index indicates to what extent w is specific to C and is defined by: 
card(C) ^w'ec ^max(a,^)( w ^ w ')' The genericity index indicates to what extent 
a term w contextualizes C. It is defined by: Ig(w) = CQr ^ c) E^gc^a,^)^)^)- 
With this representation, the labeling of each cluster finds a natural solution since each 
of its component is characterized on a specificity / genericity scale. According to what 
is looking for, one can label the clusters with its most generic terms, its most specific 
ones, an so on (see [ ] for more details). 

Starting from a set of terms C to be mapped (see the material and methods for 
the selection of terms and their indexation), the pseudo-inclusion measure transforms 
the co-occurrence matrix into an asymmetric proximity matrix V a - This matrix defines 
a directed weighted graph on C that can be further analyzed with clustering methods 
to detect informative patterns. In our case, patterns will represent domains of science 
defined by sets of strongly related terms that contextualize each other's meaning, some 
being more specific, others more generic. These sets will be called thereafter scientific 
fields. 

Several clustering methods have been proposed in literature and extensively tested 
for science mapping, e.g. k- means clustering ([ ],[]), Self-Organized Maps [22], infor- 
mation flows based [19]. However, terms can be used by different scientific communities 
with different meanings. This implies that some terms could belong to different scientific 
fields, a fact which technically requires the use of clustering methods allowing clusters 
overlap 3 . In order to keep the information conveyed by the asymmetry of V and allow 
clusters overlap, we choose to consider the detection of directed cliques [ ] as basis for 
our clustering algorithm. Extraction of directed cliques is one of the recent and convinc- 
ing algorithm that produces overlapping clusters on directed graphs. In what follows, 
the set of directed cliques (or scientific fields) is noted C = {Ci}i e i. 

After this first clustering operation, the next step is to give an insight into the artic- 
ulation of the different scientific fields to provide a global view of the scientific landscape 
covered by C. 

The pseudo-inclusion measure P a can naturally be extended to proximity between 
clusters at period T by averaging the proximity between terms of two clusters: 

p^(c a ,c b ) = -Lf £ (-L, £ P T {iJ)) 

1 a 1 iec a 1 ° b 1 jec b 

1 nf (resp. nj and nj-) is the number of articles mentioning the term i (resp. j and both i and j) 
over the period T. 

2 Note that P^(i,j) = P\ (j, i) so that if j specifies i, i contextualizes j. Moreover, lim a ^ oo {P a {h j)) 

is the inclusion measure over the sets of papers mentioning i and j. 

3 [22] allows for the same label to belong to several knowledge domains, yet SOM methods used to 
categorize abstracts indeed perform a partitioning. 
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It is important to note that two clusters can be close relatively to Pj' 2 even if they 
do not share any terms from the moment the terms they contain are related. 

Pj' 2 defines a weighted directed graph on the set of clusters that can be mapped with 
network visualization tools. Automatic cluster labeling can profitably be exploited to 
further simply the map by merging clusters with same labels. Depending on the labeling 
chosen (specific labels, generic, etc.) and the number of labels per cluster, visualizations 
will display different view points on the scientific domain under study, with different 
resolutions. 

2 Validation 

As stated before, the aim of phylogeny reconstruction is to discover patterns and reg- 
ularities in science evolution. Given this objective, we defined two benchmarks for this 
reconstruction: theoretical validation and empirical validation. 

Theoretical validation is related to the robustness of the detected patterns regarding 
the dataset ([ ]) and the parameters of the model (do in our case). Detected patterns 
should be robust to parameter change if we want them to be significant. 

Empirical validation is related to the adequacy of the reconstruction of scientific fields 
compared to the actual productions of scientific communities. To reflect the activity of a 
scientific community, it is important that scientific fields be composed with terms that are 
indeed mentioned altogether in the literature. The principle of the proposed empirical 
validation is thus to check, for each cluster, that there is some significant number of 
papers mentioning all the terms of the clusters in their full text. Moreover, a cluster 
composed by very common terms (e.g. disease , molecule, cell, division) are not as much 
informative as a cluster composed of more specific terms (e.g. cancer ,dna damage, 
apoptosis, checkpoint). This nuance can be caught by the notion of self-information 
[ ] conveyed by the observation of an event composed of independent items a\ ... a n 
which have a probability p\ ... p n to be observed individually. Self-Information is then 
defined by 7(ai,...,a n ) = Yli=i n ~l°9(Pi)' These two constraints can be synthesized 
into the empirical quality of a cluster C, defined as the products of its self-information 
with the normalized number ^ of papers mentioning all the terms of C in their full text: 
Q e (C) = 7^. J^iGC — ^°^(^)' wnere N is the total number of papers in the reference 
corpus. The empirical quality could be used as a parameter to filter phylogenies so as to 
display most relevant scientific fields. 

3 Qualifying clusters 

Relevance is not a binary judgment but rather lays on a continuum, potentially mul- 
tidimensional, reflecting what is looked for: well-recognized domains of investigation, 
emergent domains, highlights on interdisciplinary domains, etc. Empirical quality is one 
of the indexes that make it possible to qualify identified scientific fields. Furthermore, 
we studied two other indexes that help to give meaning to science evolution. 

• Density. One of the first index introduced to assess scientific fields evolution is 
the density of a field [ ]."It characterizes the strength of the links that tie the 
words making up the cluster together. The stronger these links are, the more the 
research problems corresponding to the cluster constitute a coherent and integrated 
whole. It could be said that density provides a good representation of the cluster's 
capacity to maintain itself and to develop over the course of time in the field under 
consideration." It is computed by: D(C) = Car \ c) E(™,™')eCW™' AOX), 
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• Pseudo-inclusion index. Since our goal is to find clusters where all terms are 
satisfying contexts or well context ualized by other terms in the cluster, we defined 
the pseudo-inclusion index of a cluster: I^(C) = min we c \ {^{ w ) + Ig( w ))- This 
index indicates the degree of structuration of C. Clusters with low pseudo-inclusion 
index have at least one term that does not fit well with other terms, being neither 
specific nor generic. As we shall see, the pseudo-inclusion opens some perspectives 
to the interpretation of science dynamics. 

Along with empirical quality, these two indexes will be useful to filter science maps 
and focus on some particular parts of the phylogeny. Note that whereas pseudo-inclusion 
and density can be computed without supplementary information, empirical quality 
needs additional queries to a corpus database. One issue will thus be to see the extent 
to which it is possible to use the first two indexes as proxies to evaluate the empirical 
quality. 

4 Tracking meso-dynamics 

One of the most essential features of the evolution of science is the way in which new 
associations between terms are performed and change the composition of scientific fields. 
These changes in the use of terms are the main visible evidences of shifts in scientific 
activity. Sets of terms are the adequate level to study cross-fertilization of different 
fields of science, circulation of concepts through domains, bursts of activity in a given 
branch, and so on. They are widely used by scientists, to define with few keywords, their 
research, a journal topics or a conference scope. We will call the dynamics of science 
studied at the level of sets of terms the meso-dynamics of science. Reconstructing these 
meso-dynamics is equivalent to finding a matching function between clusters of science 
maps between successive periods of time. 

The answer to this problem is far from straightforward. A scientific field, represented 
by a cluster C at a given period of time, can undertake several kinds of transformation in 
its composition that will entails a different representation in the next periods: it can gain 
new concepts, loose others, merge with an other field, split or die. Consequently, two 
successive maps can have very different sets of scientific fields. However, even if scientific 
fields were all different between two periods, they could nevertheless share some terms 
and potentially share a common scientific background. A scientific field can have several 
"offsprings" in the next period and its conceptual legacy may come from several domains 
of investigation from the previous period. 

The reconstruction of these inheritance patterns will be very useful to get a global 
overview of the activity and evolution of large scientific domains. Moreover, contrary 
to what is often encountered in biology, we should expect some hybridization events be- 
tween fields of research, which requires switching from phylogenetic trees to phylogenetic 
networks. Reconstructing the phylogenetic network of science consists in answering this 
simple question: given a scientific field C T at period T' and a period T prior to T", from 
which fields at T does C T derives its conceptual legacy? 

To achieve inter-temporal matching between fields, we have to find for each field at 
T the field or union of fields from which it inherits. We assume that the time scale of 
the evolution scientific fields is slow enough to allow simple similarity measures between 
two close periods to track the meso-dynamics of a given field. We thus seek to find 
the field or combination of fields that are most similar and therefore the most likely 
matchable. One of the most straightforward measure is a Jaccard similarity measure 4 

4 This function is the inverse of the "transformation index" introduced for similar purposes by Callon 
in [2] 
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Figure 1: Inter-temporal fields matching. 

on fields terms, thereafter denoted d. Given two fields G\ and C2, d{C\,C<i) = \cluc 2 2 \ ' 
d can be interpreted in terms of the probability that a term belonging to C\ U C2 also 
belong to C\ H C2. This is simply a measure of the overlap between C\ and C2. 

Given a conceptual field Cj G {C^ }beB at time X", we propose to perform inter- 
temporal matching by choosing its "fathers" <&f among the set of paradigmatic fields 
of the previous period {Cj} ae A as: 

<& T \Ci) = argmax KcA (d( [j C^C?')) 

keK 

With the Jaccard similarity measure we can write: 

~T> (n s \^keKCl)f]Cf\ 
* (C l ) = ar 9 rna XKcA wJ —^^ l 

Figure 1. illustrates the matching procedure. We plotted two successive sub- networks 
with the same set of nodes between two time steps. The two successive period present 
distinct cluster sets : A and B at time t and C and D at time Note that one node 

belongs to two different clusters at time t + 1. The aim is to determine from which fields 
or union of fields C and D may be descending. It is straightforward to check that field A 
is the closest to cluster C (i.e. <£ t+1 (C) = A). Even if two nodes were removed from A 
while one node was added, the similarity between A and C (d(A, C) = |) is still the best 
possible and offers the best matching. The case of D is more delicate since three cases 
are possible: D may inherit from A, B or A U B. Computing the distances according to 
each cases we get: d(D, A) = |, d(D, B) = | and finally d(D, A U B) = |. We will thus 
conclude that D most likely inherits from the merging of the two preceding fields A and 
B and thus conclude that $ m (L>) = A U B. 

Since it would seem incorrect to match two fields that have very few terms in common 
even though no better matching is possible, we need to define a threshold above which 
the matching is satisfying. We shall call this threshold do. One can tune this threshold 
requiring a minimum amount of similarity. As we shall see, activity patterns in the 
phylogeny (areas of activity burst, areas with emergent fields, branches death, etc.) are 
robust to variations of do provided that do does not get too close from or 1. 

5 Phylogenetic Patterns 

We performed phylogeny reconstruction on the MedLine database focusing on research 
in biological and biomedical fields related with network studies. After the constitution of 
a database concerning a set C of 834 terms (cf. materials and methods), we generated 
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Figure 2: Detail of the map related to network studies in biology obtained on the period 
2004-2007 for terms list C (see material and methods). Clusters were required to have at 
least 4 terms. Clusters are labeled with their two most generic terms and merged with 
clusters with the same label. Size of text and bubbles map the density of clusters. This 
value is averaged over the set of merged clusters if the node is made of merged clusters. 
The value of the link between two sets of merged clusters is the maximum value of the 
inter-cluster similarity between all pairs of clusters (Visualized with Gephi.org). The inset gives 
a detail of the cluster labeled inhibitor /apoptosis plotted in the (/" (w) , Ig (w)) space. 
Sizes of the bubbles map the number of co-occurrences with other terms of the cluster 
and colors these numbers growth rates compared to 2004-2007. 

maps processed on four years sliding time windows from 2007 to 1987. An exemple of 
these maps is given in figure 2. 

We reconstructed the phylogeny of the domains related to networks studies in biology 
over the period 1987-2007 and studied the patterns of three indexes of cluster structura- 
tion: the density, pseudo-inclusion and empirical quality. Releasing all constraints on 
the phylogeny except that we required the fields to have at least four elements and a non 
null empirical quality, the phylogenetic network obtained was made of 7759 nodes. 

Within this network, we observed a significant positive correlation between the pseudo- 
inclusion index and the empirical quality. The Pearson coefficient r lays within the 95% 
confidence interval [0.14; 0.19], the probability to obtain a correlation as large as the ob- 
served value by random chance being p = 4.10 -39 . Between the pseudo-inclusion index 
and the number of papers per cluster we get 0.28 < r < 0.32 and p = 0. To a lesser 
extent, there is a significant positive correlation between the density index and the em- 
pirical quality (0.03 < r < 0.08, p = 4.10 -6 ) as well as with the number of papers per 
cluster (0.16 < r < 0.21, p = 0). 

We categorized the fields according their position in the phylogenetic network: aborted 
(no father, no child), new comers (no father, some children), adult (with father (s) and 
son(s)) and dying fields (with some father (s) but no child). Note that a cluster may 
belong to a different category according to the value of do. The distribution of scientific 
fields regarding to these categories is particularly interesting. Plotting the variations of 
the fields' pseudo-inclusion (fig. 3. a), density and pseudo- inclusion (Appendix. 1) indexes 
against this categorization, we found very clear patterns in the domain 0.3 < do < 0.6: 
aborted, new comer and dying fields tend to have weaker indexes than adult fields, with 
aborted fields having slightly lower values for their indexes than new comers. Similar 
patterns have been obtained for the density and empirical quality (cf. Appendix. 1). 

The dependency of the mean of the density, pseudo-inclusion and empirical quality 
indexes over the position of the fields in the phylogeny suggests trends in the "life cycle" 
of scientific fields: these indexes grow while a new field emerges, and then loose their 
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Figure 3: Dependencies of the mean of the pseudo-inclusionover the position of the fields 
in the phylogeny (a) as well as over its number of sons (b) suggest trends in the "life cycle" 
of scientific fields: these indexes grow while a new field emerges in bushy branches, and 
then loose their strength when it begins to be neglected by the community. As shown, 
these patterns are robust against variations in the domain 0.3 < do < 0.6. Error bars 
indicate the 95% confidence interval. 

strength when it begins to be neglected by the community. However, density and pseudo- 
inclusion index are completely different ways of characterizing scientific fields. On the 
one hand, fields with high pseudo-inclusion will usually have terms with a large spectrum 
of specificity and genericity, which means that they are likely to contain very specific 
terms with few occurrences. These terms have a high probability to be new concepts 
or new objects of study. Their presence in the phylogeny will then be correlated with 
high rate of branching processes. On the other hand, fields with a high density index 
correspond to well structured scientific domains with a priori lower rate of conceptual 
renewal. 

Further studies based on different databases will confirm or not the relevance of 
these general patterns in the study of science evolution. However, these regularities 
open perspectives for the detection of emergent or dying fields on the basis of some 
indexes computed on co-occurrence data. 

Beside, the fact that aborted fields tend to be of lower quality suggests a methodology 
to adjust optimally do in order to have the most informative phylogeny (in the sense of 
the empirical quality). Indeed, the ratio between the mean quality of fields belonging 
to the phylogeny and the mean quality of aborted fields is always higher than 1, and 
reaches its maximum around the value d c = 0.33. For this value, connected fields in the 
phylogeny i. e. fields that have at least one father or one son, are on average almost twice 
as informative as aborted fields. 

Inheritance patterns can be studied by classifying fields according to their number 
of sons in the phylogenetic network. While most fields have less than 2 sons, with 44% 
having only one successor, almost 14% have at least 3 children. Again, the distribution of 
the different indexes in function of the number of children is very instructive. Figure 3.b 
shows that, on average, the maximum of density is obtained for fields that have only one 
son. Similar patterns have been obtained for the pseudo-inclusion and the number of 
papers per cluster (cf. Appendix. 2). Again, this observation holds for a large range of 
do. The synthesis of all these results suggests that relatively young branches of science 
are generally bushy with fields having lots of children. This corresponds to an intense 
exploration of new directions of research. Older fields will generally have a much more 
linear evolution with a lower rate of conceptual renewal. This pattern can clearly be 
observed on figure 4 that represents, for do = d c , the subpart of the phylogenetic network 
composed of fields with highest empirical quality and at least four terms. Most recent 
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branches have also been removed to meet editorial constraints. We can also notice that 
there is much more hybridation between scientific fields in the domain of formal methods 
and tools than in the branches corresponding to topics in biology. This transversal 
domain is also over-represented due to the fact that the targeted thematic (networks) is 
itself a transversal methodology. 

Details of the phylogeny are also very informative. Figure 5 represents the phylogeny 
with fields of more than five terms for which at least one term contains the words "cancer" 
or "tumor" . On this partial phylogeny, we can clearly see three distinct sets of branches 
with very different characteristics. Two sets are quite bushy and deals with cancer and 
DNA issues on one side, cancer, tumor and proliferation issues on the other side. They 
appear to have increased their interactions these last several years around the concepts 
of apoptosis, suppressor and cell cycle. The third set has very linear branches and is 
related to the relations between tumor and the immune system. These three sets are 
also quite distinct in terms of the range of their density and pseudo-inclusion indexes. 
Whereas the bushy branches tend to have a higher pseudo-inclusion index than the linear 
ones, revealing a higher rate of conceptual renewal, they also have a lower density index, 
indicating that they should be more recent. The study of the evolution of the pseudo- 
inclusion index along these branches reveals that this index is increasing along most 
of the branches although its growth rate is decreasing with time. When relaxing the 
constraints on the empirical quality threshold and on the number of terms in clusters, 
these characteristics regarding the three sets of branches are preserved, although the 
branches prove to be older than they appear in this partial phylogeny, the upper-part of 
the phylogeny having been pruned in the thresholding process. 

6 Toward quantitative epistemology 

The seminal work of Callon et. al. [ ] was the first attempt to quantify the evolution 
of scientific fields through co-word analysis, monitoring inter alia, the evolution of the 
density of clusters. Our work proposes the first automated methods for the bottom-up 
reconstruction of the entire phylogeny of a domain of science and is clearly in line with 
their approach. We expanded their approach in several ways, trying to take into account 
the classical limitations of scientometrics that have been expressed hitherto. 
Coverage: Co- word analysis can cover the largest bibliographic database available. 
Nowadays, online publishers cover between 30 and 40 million articles, which represent a 
significant part of worldwide scientific literature. We gave an example on a case study 
based on MedLine (14M papers) covering most medical and biological research. 
Ambiguity: Contrary to [ ] and most subsequent works, we used overlapping clus- 
tering algorithms in order to ensure that we can handle ambiguity in terms use and 
avoid false negatives in scientific fields detection e.g. terms that are classified in different 
clusters although they are strongly related. 

Asymmetry and bottom- up multi-level mapping: Following previous work [ ], 
we based our clustering algorithm on an asymmetric proximity measure in order to fully 
reflect the organization of science into domains and sub-domains. This asymmetry makes 
it possible to highlight the internal structure of clusters allowing automatic labeling [8] in 
a bottom-up way (similarly to [ ] but contrary to top-down labeling, e.g. [14], [1] or [ ] 
who use ISI journal classification to label clusters). This offers possibilities of multi-level 
mapping with multiple view points on the phylogeny according to the required degree 
of specificity. We also introduced a measure of fields structuration, the pseudo-inclusion 
index, based on this new asymmetric proximity and we showed that the pseudo-inclusion 
index appears to be very informative when assessing the evolution of a fields of research. 
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Cellular Biology 




Immunology 




Figure 4: Extract of the full phylogeny of domains related to networks studies in biology 
and medical research. We kept fields made of more than four terms, set a threshold 
on the empirical quality (0.04) and removed shortest branches for editorial purposes. 
Some branches have been gathered compared to GraphViz display on the basis of their 
thematic. Colors map the pseudo-inclusion index of the fields. Fields are labeled with 
their most generic term, except for the beginning of a branch or for the most recent 
period, where all terms are displayed. The labels of inter-period arrows indicate which 
terms have been lost or gained between two periods. The number on the first line of a 
field label is the field id, le number on the last line is the number of articles mentioning 
all terms of the fields in the reference database. Zoom in to see details. 
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Figure 5: Detail of the sub-phylogenetic network related with cancer studies. Colors of 
the circles, from blue to red, maps the growth rate of the pseudo-inclusion index. Red 
links indicate the introduction of at least one new term. Note that this index is increasing 
along most of the branches (warm colors) although its growth rate is decreasing with 
time. Fields are labeled with their most generic term, except for the beginning of a 
branch or for the most recent period, where all terms are displayed. The labels of inter- 
period arrows indicate which terms have been lost or gained between two periods. In 
cluster labels, the number on first row indicates the cluster id and the number on last 
row indicates the number of articles mentioning all terms of the cluster. 
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Validation: Complementary to [10] who suggested to use both "internal validation" 
(i.e. by experts of the domains) and "external validation" (i.e. by users of the maps), 
and [11] who proposed a method to asses the stability of a clustering, we proposed an 
empirical validation of science maps (confrontation with real data) that complement 
these approaches. We introduced the empirical quality that reflects the amount of in- 
formation conveyed by a cluster about actual scientific activity and showed that the 
pseudo-inclusion index was positively correlated with the empirical quality. The density, 
on the other hand, was only weakly correlated. 

Dynamics: The proposed methodology capitalises on the availability of diachronic 
data to reconstruct the phylogeny of scientific fields, and takes into account multiple 
filiations, contrary to what could have been done in other related fields like social group 
evolution [ ] or [ ]. The reconstructed science phylogeny revealed strong and robust 
patterns which appear to highlight strong regularities in science evolution. 

This approach opens perspectives both from theoretical and applicative points of 
view. While we tried to show that researches in the reconstruction of science dynamics 
are close to the point where they will make it possible to corrobate or falsify theories in 
epistemology and science studies, we can also expect they will considerably renew the 
way we interact with science, especially when browsing large-scale electronic databases. 
Moreover, the methodology presented here is not specific to scientific corpora and may be 
applied to a wide range of co-occurrence data from online communities, patents database, 
folksonomies, web queries or even experimental data like micro- array data. 

7 Appendix 

7.1 Indexation: from corpus to data 

In order to propose scalable methods on rough data, we considered indexes of science 
databases as proxies to evolution of science, e.g. as they are already built by search 
engines. Our method thus cope with the constraint of working with aggregated co- 
occurrence data of terms in articles. Other methods bring interesting complementary 
perspectives in epistemic communities dynamics but require a more detailed access to 
data sets (like author-based data for example [20]). 

Co-word analysis critically depends on the initial set of terms chosen for the study 
and can be biased by the "indexer effect" ([24], [5], [9]). This effect can have several 
origins: terms selected by the indexers are too general, specific terms have been omitted 
from the satisfactory list or the indexer puts the wrong emphasis, or even a mistaken 
emphasis in key wording. For the case study presented in this paper, we choose a semi- 
automatic method that takes advantage both of powerful automated parsing of large 
corpora and experts skills to minimize this effect. We also choose to index terms within 
abstracts or full text of articles rather than in keywords lists provided by publishers or 
authors. 

The case study presented in this article targets the question of networks in medical 
and biological research. We choose PubMed-MedLine as data source since it covers most 
of the publications in biology (more than 17M references), while titles and abstracts 
of articles are freely available. We then choose few concepts related to network-based 
approaches (network, evolvable, evolvability, hub, feedback) and retrieved all the papers 
mentioning at least one of these terms in MedLine (about 2,4M references). We then 
indexed these 2,4M abstracts with date of publication and retrieved all n- grams 5 with a 

5 Key phrases with exactly n terms. 
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number of occurrences higher than 100 ™ and n < 3 over the whole period (e.g. the term 
protein interaction network has to appear at least in 5 references to be included in our set 
of candidate keywords). Stop words were discarded. This list of terms was then checked 
by science historians to further discard uninformative terms, which finally lead to a set C 
of 834 terms (available at http://www.maps.sciencemapping.com/eprint/phylo/appendix3.txt). 

These terms were then indexed from 1950 to 2008 in the 2,69M retrieved abstracts 
to build the co-occurrence array Ait of all co-occurrences for terms in C from 1950 to 
2008. A4 t (hj) gives the number of articles published during the year t which mentioned 
both terms i and j in their abstract. 



7.2 Software 

We developed and used the Words Evolution software (http:/ /sciencemapping.com/WE) 
to process and visualize the phylogenies. This software is interfaced with network visu- 
alization tools like Gephi or Graphviz as well as clustering softwares like Cfinder. 

7.3 Supplementary figures 
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Figure 6: Dependencies of the mean of the density and empirical quality over the position 
of the fields in the phylogeny. As shown, these patterns are robust against variations in 
the domain 0.3 < do < 0.6. Error bars indicate the 95% confidence interval. 




Figure 7: Dependencies of the mean of the density of clusters and number of articles in 
fonction of the number of sons. As shown, these patterns are robust against variations 
in the domain 0.3 < do < 0.6. Error bars indicate the 95% confidence interval. 
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